VRVis - ComVis - VAST 2010 mini challenge 3

VRVis - ComVis

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Zoltán Konyha, VRVis, konyha@vrvis.at [PRIMARY contact]
Andreas Ammer, VRVis, ammer@vrvis.at
Krešimir Matković, VRVis, matkovic@vrvis.at
Çağatay Turkay, University of Bergen, Cagatay.Turkay@ii.uib.no
Denis Gračanin, Virginia Tech, gracanin@vt.edu

Tool(s):

We have developed a Python script to preprocess the data set and compute data for pairs of sequences. There are 10 native and 58 current sequences in this challenge, thus 68*67=4556 pairs can be constructed that represent possible mutations in the evolution of the virus. Each pair is one item in our derived data set. Properties of each pair include:

· ID1, CATEGORY1: name and category ("native" vs. "current") of the original sequence.

· ID2, CATEGORY2: name and category of the mutated sequence.

· DIFFCOUNT: number of differences between sequences ID1 and ID2.

· DIFFBASES: set of base substitutions the mutate sequence ID1 to ID2.

For "current" sequences we added disease characteristics, too.

We explored this data set in ComVis, our interactive, multiple linked views visualization application. ComVis offers several types of views for scalar, categorical and set type data. Each view is interactive and brushable. Brushes defined in the same view or in different views can be combined using boolean operators. The visual analysis context can be captured in session files. Exchanging session files facilitates better collaboration among our team members distributed in several cities.

ComVis supports visualization of set typed data in histograms. The histogram includes one vertical bar for each possible element of the set typed dimension. Items of the set typed dimension contribute to the bars pertaining to all of their elements

In our initial attempts at MC3.3 and MC3.4 we used Jalview, a multiple sequence alignment editor.

Video:

Download video (11.3 MB)

ANSWERS:

MC3.1: What is the region or country of origin for the current outbreak? Please provide your answer as the name of the native viral strain along with a brief explanation.

Nigeria_B

Explanation:

The origin of the current outbreak can be found by identifying the native sequence that is most similar to the current ones. Figures 1.1 and 1.2 illustrate how the logical AND of three brushes reveals this information.

Figure 1.1: Top left: pairs where the initial sequence is "native" are brushed (red rectangle). Top, middle: items where the mutated sequence is "current" are brushed. Top right: each point in this scatter plot represents one pair of sequences. ID1 is on the horizontal axis, ID2 is on the vertical axis. Bottom: histogram of the number of differences. (Click to enlarge.)

Figure 1.2: The red brush in the bottom histogram narrows the focus to pairs with few differences. The red highlighted points in the scatter plot indicate that all of the current strains are similar to Nigeria_B. The brushed pairs are shown in the tabular view at the bottom, too.

MC3.2: Over time, the virus spreads and the diversity of the virus increases as it mutates. Two patients infected with the Drafa virus are in the same hospital as Nicolai. Nicolai has a strain identified by sequence 583. One patient has a strain identified by sequence 123 and the other has a strain identified by sequence 51. Assume only a single viral strain is in each patient. Which patient likely contracted the illness from Nicolai and why? Please provide your answer as the sequence number along with a brief explanation.

123

Explanation:

We assume that the person who contracted the illness from Nicolai has a strain that is more similar to Nicolai's. Therefore, we need to find out whether sequence 123 or 51 is more similar to 583. Sequences 583 and 123 differ in one position (see Figure 2.1). Strains 583 and 51 differ in three positions (see Figure 2.2). The patient with sequence 123 is more likely to have contracted the illness from Nicolai.

Figure 2.1: Top left: select the pairs of sequences where ID1 is 583 (Nicolai's sequence). Top right: select pairs where ID2 is 123. The histogram at the bottom and the tabular view indicate that the number of differences between those two sequences is 1.

Figure 2.2: Top left: select the pairs of sequences where ID1 is 583 (Nicolai's sequence). Top right: select pairs where ID2 is 51. The number of differences between those two sequences is 3.

MC3.3: Signs and symptoms of the Drafa virus are varied and humans react differently to infection. Some mutant strains from the current outbreak have been reported as being worse than others for the patients that come in contact with them.

Identify the top 3 mutations that lead to an increase in symptom severity (a disease characteristic). The mutations involve one or more base substitutions. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39)

A → T, 946, and T → C, 842

A → C, 269

A → G, 223

(positions are 1 based)

Explanation:

We look for the three most common mutations that change symptoms from mild to severe.

Figure 3.1: Selecting mild symptoms before mutation (top left) and severe symptoms after mutation (top right). The largest red bars in the bottom histogram indicate the most common base substitutions found in those pairs: 22GC, 161CG, 223AG, 269AC, 842TC and 946AT.

The highlighted base substitutions in Figure 3.1 are found in mutations that increase symptom severity, but that does not mean that all of them cause an increase. We can brush them one by one and observe the change in symptom severity to find out which have a decisive effect. This procedure is captured in the video and in Figure 3.2. We found that 22GC and 161CG are not decisive in increasing symptom severity, so they have been discarded.

Figure 3.2: All mutations that include 223AG change symptoms from mild to severe.

MC3.4: Due to the rapid spread of the virus and limited resources, medical personnel would like to focus on treatments and quarantine procedures for the worst of the mutant strains from the current outbreak, not just symptoms as in the previous question. To find the most dangerous viral mutants, experts are monitoring multiple disease characteristics.

Consider each virulence and drug resistance characteristic as equally important. Identify the top 3 mutations that lead to the most dangerous viral strains. The mutations involve one or more base substitutions. In a worst case scenario, a very dangerous strain could cause severe symptoms, have high mortality, cause major complications, exhibit resistance to anti viral drugs, and target high risk groups. For this question, the biological properties of the underlying amino acid sequence patterns are not significant in determining disease characteristics.

For each mutation provide the base substitutions and their position in the sequence (left to right) where the base substitutions occurred. For example,

C → G, 456 (C changed to G at position 456)

G → A, 513 and T → A, 907 (G changed to A at position 513 and T changed to A at position 907)

A → G, 39 (A changed to G at position 39).

G → C, 848

T → C, 527

A → C, 269

(positions are 1 based)

Explanation:

Our initial answer to this question (and the previous one) was based on analysis with Jalview. We have filtered the sequences with a Python script and removed columns that are the same in all sequences. The script prints a mapping from the "reduced" column numbers to the original ones. Strains were sorted by their combined disease characteristics. The most dangerous viral strains, 118, 123 and 501 have four out of five disease characteristics rated "most dangerous". They are shown in the last three lines in Figure 4.1.

Figure 4.1: Viral sequences in Jalview. Each line represents one sequence. Sequences are sorted by their combined disease characteristics. The most dangerous ones are shown at the bottom. The consensus diagram at the bottom indicates the most common bases in each column. Colors (dark blue to white) indicate how often the given base occurs at that position. Lighter colors indicate less frequent mutations.

To find base substitutions that lead to the most dangerous viral strains we need to find bases that appear often in the last three lines but rarely in the other ones. They are highlighted by the yellow ovals in Figure 4.1. Unfortunately, the positions indicated by the small red rectangles (and also displayed in the status bar) are valid in the data set where matching columns have been removed. One needs to look up the original position in the mapping printed by the script, which is not very convenient. If some other consequences of mutations are to be explored, then one needs to change the sorting in the script and start the Jalview session from scratch.

We were not happy with the flexibility and interactivity of this procedure. We tried a more interactive solution (already presented in MC3.3), based on computing the base substitutions that lead from the initial sequence to the mutated one in each pair. The disease characteristics after mutation are displayed in a parallel coordinates view. Each axis represents one disease characteristic. The most dangerous strains can be selected by brushing the top of each axis. We expected to create five brushes, and then observe the logical AND of those brushes. This process is captured in Figures 4.2 and 4.3. The selected viral strains are highlighted in the histogram on the right. The name of the bar under the mouse pointer is shown under the middle of the histogram as the mouse is hovered over the histogram. We can point at the red bars and learn that the most dangerous viral strains are 118, 123 and 501. Four out of the five disease characteristics are rated most dangerous for them, while complications are only minor.

It is worth mentioning that we tried to find strains that cause major complications while being less dangerous in some other characteristic. One such strain is 211: major complications, high mortality, resistant to anti viral drugs, but rated only moderate in the remaining two characteristics. Strains 202 and 705 are also rated most dangerous in three characteristics and moderate in the other two. We consider those strains less dangerous than the ones with four top rated disease characteristics.

Figure 4.2: Top left: mutations that lead to severe symptoms and high mortality are brushed. There is no red line going through major complications, which indicates that there are no strains with severe symptoms, high mortality and major complications. Top right: each bar represents a sequence ID. Bottom: each bar represents a base substitution. The red ones are involved in mutations that lead to the selected sequences.

Figure 4.3: All views show the same data as in Figure 4.2, but in the parallel coordinates strains that are resistant to anti viral drugs and target high risk groups are brushed in addition.

The histogram at the bottom in Figure 4.3 displays the base substitutions that lead to those viral strains. Each bar represents a base substitution. The ones highlighted in red are included in the mutations that lead to one of the three selected viral strains. Now we need to identify the base substitutions that appear the most often within this subset.

This task is not completely intuitive. Each bar of the histogram indicates the number of mutations that include the given base substitution. The red parts of the bars indicate the number of mutations that include the given base substitution and lead to one of the most dangerous strains. An entirely red bar indicates that the specific base substitution is included in all mutations that lead to those strains. If a half of a bar is red, then the half of the mutations that include the given base substitution lead to the selected viral strains. Therefore, we now need to find bars in this histogram that contain the largest red parts relative to the entire bar.

This is a common pattern in analysis tasks and ComVis includes a "relative" option in its histograms to facilitate such comparisons. When this option is enabled (see Figure 4.4) all histogram bars become equally long. That causes the histogram to loose its original meaning, but it also makes it possible to directly compare percentages of the brushed subsets.

We can see that two bars, 848GC and 527TC are completely red. Those mutations always lead to very dangerous viral strains. A half of 269AC is also brushed. They lead to the top three most dangerous mutations.

Figure 4.4: Bottom: the "relative" option of the histogram makes it possible to compare the brushed percentages of bars in the histogram. Compare to Figure 4.3.

We can cross-check those results by performing a reverse investigation: selecting all mutations that contain those base substitutions. A part of this procedure is captured in Figure 4.5.

Figure 4.5: Bottom: the base substitution "A → C, 269" is brushed. Top right: this base substitution is found in mutations leading to four viral strains: 99, 118, 123 and 997. Two of those, 118 and 123 are very dangerous. 99 and 997 are less aggressive towards high risk groups.

VRVis - ComVis

VAST 2010 Challenge Genetic Sequences – Tracing the Mutations of a Disease

Authors and Affiliations:

Tool(s):

VAST 2010 Challenge
Genetic Sequences – Tracing the Mutations of a Disease